Sending and Receiving Files

Not all calculations can return their result via the remotemanager syntax. For this reason, Dataset also allows you to work with files, providing the extra_files_send and extra_files_recv hooks.

These perform as you would expect, a dataset which has some extra files to send will attempt to grab those and send them with each run. Likewise, if extra files are specified to be pulled, a fetch_results() call will attempt to fetch those files also.

Lets start with a function which merges two files to demonstrate this:

[2]:
from remotemanager import Dataset

def merge(fpath_a, fpath_b, fpath_c):
    with open(fpath_a, 'r') as o:
        data_a = o.read()

    with open(fpath_b, 'r') as o:
        data_b = o.read()

    with open(fpath_c, 'w+') as o:
        o.write(data_a + '\n' + data_b)

merge_files = Dataset(merge, skip=False)

Now we have our function, we need to create some files to send and merge:

[3]:
with open(f'temp_file_a.txt', 'w+') as o:
    o.write('hello, world!')

with open(f'temp_file_b.txt', 'w+') as o:
    o.write('add me to the output!')
[4]:
args = {'fpath_a': 'temp_file_a.txt',
        'fpath_b': 'temp_file_b.txt',
        'fpath_c': 'output.txt'}

merge_files.append_run(args = args,
                       extra_files_send = ['temp_file_a.txt', 'temp_file_b.txt'],
                       extra_files_recv = ['output.txt'])
appended run runner-0

Now run and collect our results:

[5]:
merge_files.run()
Staging Dataset... Staged 1/1 Runners
Transferring for 1/1 Runners
Transferring 7 Files in 2 Transfers... Done
Remotely executing 1/1 Runners
[5]:
True
[6]:
merge_files.wait(1, timeout=10)
merge_files.fetch_results()
Fetching results
Transferring 3 Files... Done

Lets see what’s in the results, and if the file has been returned as expected:

[7]:
print(merge_files.results)
[None]
[8]:
with open(f'{merge_files.local_dir}/output.txt', 'r') as o:
    print(o.read())
hello, world!
add me to the output!

Looks like it worked. Since the function itself does not return anything, we see None in the resutlts.

File Paths

When using this feature, it’s important to pay attention to the locations of your files, as it’s easy to get confused.

extra_files_send bases its locations on the current working directory from where the datset is run. When the run() command was issued for this example: Dataset will have looked within os.getcwd() for the files temp_file_a.txt and temp_file_b.txt. In short, it operates between pwd and the remote dir.

extra_files_recv is slightly different, operating between the local_dir and remote directory. This can be seen in the above example in that the output.txt is dropped into the local_dir rather than where the input files were sourced.

Fine Control

Added in version 0.12.3.

If the standard behaviour of files being sent between the working dir and remote dir aren’t to your liking, there are other options.

While slightly more complex in terms of syntax, you also have the option of having fine control over your file locations.

Dict control

One way to do this is to specify your listings as dictionaries. This takes the form:

[9]:
extra_files_send = [{"local/path/to/file.txt": "path/to/target"}]

Note

It is assumed that the file name will be identical on the remote and local sides. If you need to change the name, you should do so within your Function.

In this case, it tells remotemanager that the extra file file.txt can be found in the directory local/path/to/file/, and that we want it to be sent to a directory named path/to/target relative to the dataset remote_dir.

Paths

Note that remote_dir is relative to the Dataset.remote_dir property, unless you specify an absolute path.

If we assume that we have a Dataset with the remote_dir set to remote_run, then we can send file.txt to remote_run/inner_dir using:

[10]:
extra_files_send = [{"file.txt": "inner_dir"}]

However:

[11]:
extra_files_send = [{"file.txt": "/home/user/run_data"}]

Would send file.txt to /home/user/run_data

Demonstration

We can demonstrate this with a simple function that reads the contents of a file.

[12]:
def read(file):
    with open(file) as o:
        return o.read()

def create_file(fname):
    with open(fname, "w+") as o:
        o.write("foo")
[13]:
ds = Dataset(read, name="read", skip=False)

The following setup is the same as the standard behaviour. We can print the intended remote directory of the extra file by accessing its remote property.

Note

Internally, your file specs are converted into a list of TrackedFile objects, so all the methods available to these can be used here.

[14]:
create_file("tmp_standard.txt")

ds.append_run({"file": "tmp_standard.txt"}, extra_files_send=[{"tmp_standard.txt": ""}])
appended run runner-0
[15]:
print("Remote path for standard file:", ds.runners[0].extra_files_send[0].remote)  # print the remote, for debugging
Remote path for standard file: temp_runner_remote/tmp_standard.txt

Note

To send to the remote_dir you can use the empty string "" or the “current dir” shortcut ".".

To send the file to a directory within the remote_dir, we can use this setup:

[16]:
create_file("tmp_dir.txt")

ds.append_run({"file": "inner_dir/tmp_dir.txt"}, extra_files_send=[{"tmp_dir.txt": "inner_dir"}])
appended run runner-1
[17]:
print("Remote path for inner_dir file:", ds.runners[1].extra_files_send[0].remote)
Remote path for inner_dir file: temp_runner_remote/inner_dir/tmp_dir.txt

Otherwise, we can send the file to any arbitrary directory, if we know the abspath.

Note

Note that this will add a level of machine dependence to your run, as remotemanager expects that this path is valid and exsts.

[18]:
# create path using $HOME to allow testing
home = os.path.expandvars("$HOME")

path = os.path.join(home, "test")
file = os.path.join(path, "tmp_abs.txt")
[20]:
create_file("tmp_abs.txt")

ds.append_run({"file": file}, extra_files_send=[{"tmp_abs.txt": path}])
appended run runner-2
[21]:
print("Remote path for abspath file:", ds.runners[2].extra_files_send[0].remote)
Remote path for abspath file: /home/test/test/tmp_abs.txt
[22]:
ds.run()

ds.wait(1, 10)
Staging Dataset... Staged 3/3 Runners
Transferring for 3/3 Runners
Transferring 12 Files in 4 Transfers... Done
Remotely executing 3/3 Runners

If we collect the results we should see that all the files have been read in by the function.

[23]:
ds.fetch_results()
ds.results
Fetching results
Transferring 6 Files... Done
[23]:
['foo', 'foo', 'foo']

TrackedFile

Internally, all extra files are converted to TrackedFile instances. This allows the option for specifying these directly, if you prefer. Lets add an extra runner which displays this behaviour:

[24]:
from remotemanager.storage import TrackedFile

tfile = TrackedFile(".", ds.remote_dir, "trackedfile.txt")

# we can now use the write method of the TrackedFile class to add content to this file
tfile.write("foo, tracked")

ds.append_run({"file": tfile.name}, extra_files_send = [tfile])
appended run runner-3

Running this dataset again will run the new runner, showing the new file with its content:

[25]:
ds.run()
ds.wait(1, 10)
Staging Dataset... Staged 1/4 Runners
Transferring for 1/4 Runners
Transferring 6 Files in 2 Transfers... Done
Remotely executing 1/4 Runners
[26]:
ds.fetch_results()
ds.results
Fetching results
Transferring 2 Files... Done
[26]:
['foo', 'foo', 'foo', 'foo, tracked\n']

Important

The key points to setting up a TrackedFile are that the setup args are as follows: TrackedFile(local_dir, remote_dir, filename). This sets up a file-like entity that provides the ability for remotemanager to “track” the behaviour between local_dir and remote_dir.

Retrieving Files

Using this methodology to collect files from your runs follows the same syntax. Keep in mind that the value of the dictionary is the remote specification, and the filename has to go in the key.

[27]:
extra_files_recv = [{"local/path/to/file.txt": "remote_path"}]

This would fetch "file.txt" from temp_runner_remote/remote_path/file.txt, and move it to local/path/to/file.txt

Note

Just like with sending, you can also use abspaths here.